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Abstract 



We consider dynamic programming problems with a large time horizon, and give suf- 
ficient conditions for the existence of the uniform value. As a consequence, we obtain 
an existence result when the state space is precompact, payoffs are uniformly contin- 
uous and the transition correspondence is non expansive. In the same spirit, we give 
' ■^ ' an existence result for the limit value. We also apply our results to Markov decision 

processes and obtain a few generalizations of existing results. 
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: 1 Introduction 

m 

We first and mainly consider deterministic dynamic programming problems with 
infinite time horizon. We assume that payoffs are bounded and denote, for each 
n, the value of the n-stage problem with average payoffs by v n . By definition, 
the problem has a limit value v if (v n ) converges to v. It has a uniform value v 
if: (v n ) converges to v , and for each e > there exists a play giving a payoff not 
lower than v — e in any sufficiently long n-stage problem. So when the uniform 
value exists, a decision maker can play e-optimally simultaneously in any long 
enough problem. 

In 1987, Mertens asked whether the uniform convergence of (v n ) n was enough 
to imply the existence of the uniform value. Monderer and Sorin (1993), and 
Lehrer and Monderer (1994) answered by the negative. In the context of zero- 
sum stochastic games, Mertens and Neyman (1981) provided sufficient conditions, 
of bounded variation type, on the discounted values to ensure the existence of 
the uniform value. We give here new sufficient conditions for the existence of this 
value. 
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We define, for every m and n, the value v m>n as the supremum payoff the 
decision maker can achieve when his payoff is defined as the average reward 
computed between stages m+1 and m + n. We also define the value w m ^ n as the 
supremum payoff the decision maker can achieve when his payoff is defined as the 
minimum, for t in {1, .., n}, of his average rewards computed between stages m+1 
and m + t. We prove in theorem 13.71 that if the set W = {w mtn , m > 0, n > I }, 
endowed with the supremum distance, is a precompact metric space, then the 
uniform value v exists, and we have the equalities: v = sup m>0 inf n >! w m n (z) 
= sup m > inf n >! v m>n (z) = inf n >isup m > ?; min (z ) = inf n >! sup m > w m>n (z). In the 
same spirit, we also provide in theorem 13. XOI a simple existence result for the 
limit value: if the set {v n ,n > I}, endowed with the supremum distance, is 
precompact, then the limit value v exists, and we have v = sup m>0 inf n >i v m)Tl {z) 
= inf n >i sup m>0 v m>n (z). These results, together with a few corollaries of theorem 
13.71 are stated in section [3j 

Section 4 is devoted to the proofs of theorems 13 . 71 and 13.101 Section [5] contains 
a counter-example to the existence of the uniform value, comments about 0- 
optimal plays, stationary e-optimal plays, and discounted payoffs. In particular, 
we show that the existence of the uniform value is slightly stronger than: the 
existence of a limit for the discounted values, together with the existence of e- 
Blackwell optimal plays, i.e. plays which are e-optimal in any discounted problem 
with low enough discount factor (see Rosenberg al, 2002). 

We finally consider in section [6] (probabilistic) Markov decision processes 
(MDP hereafter) and show: I) in a usual MDP with finite set of states and 
arbitrary set of actions, the uniform value exists, and 2) if the decision maker can 
randomly select his actions, the same result also holds when there is imperfect 
observation of the state. 

This work was motivated by the study of a particular class of repeated games 
generalizing those introduced in Renault, 2006. Corollary 13.81 can also be used to 
prove the existence of the uniform value in a specific class of stochastic games, 
which leads to the existence of the value in general repeated games with an 
informed controller. This is done in a companion paper (see Renault, 2007). 
Finally, the ideas presented here may also be used in continuous time to study 
some non expansive optimal control problems (see Quincampoix Renault, 2009). 

2 Model 

We consider a dynamic programming problem (Z, F, r, z ) where: Z is a non 
empty set, F is a correspondence from Z to Z with non empty values, r is a 
mapping from Z to [0, 1], and Zo E Z. 

Z is called the set of states, F is the transition correspondence, r is the reward 
(or payoff) function, and Zq is called the initial state. The interpretation is the 
following. The initial state is Zq, and a decision maker (also called player) first 
has to select a new state z\ in F(zq), and is rewarded by r(z\). Then he has to 
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choose Z2 in F(z\), has a payoff of rfa), etc... We have in mind a decision maker 
who is interested in maximizing his "long-run average payoffs", i.e. quantities 
j(r(zi) +r(z 2 ) + ... +r(z t )) for t large. From now on we fix T = (Z, F, r), and for 
every state z we denote by T(z ) = (Z,F,r,z ) the corresponding problem with 
initial state z . 

For zo in Z, a play at zq is a sequence s — (z±, z t , ...) G such that: Vt > 
1, 2* G F(z t -i). We denote by S(z ) the set of plays at z , and by S = U Zo ezS(z ) 
the set of all plays. For n > 1 and s = (z t )t>i G 5, the average payoff of s up to 
stage n is defined by: 

1 n 

t=i 

And the n-stage value of r(zo) is: v n (^o) = sup 7 n (s)- 

se5(z ) 

Definition 2.1. Let z &e m Z. 

T/ie liminf value of T(z) is v~(z) = liminf n v n (z). 
The limsup value of T(z) is v + (z) = limsup„ v n (z). 

We say that the decision maker can guarantee, or secure, the payoff x in T(z) 
if there exists a play s at z such that liminf n 7„(s) > x. 
The lower long-run average value is defined by: 

v(z) = sup {a; G M, the decision maker can guarantee x in T(z)} 

= sup (liminf 7„(s) ) . 
ses(z) ^ n ' 

Claim 2.2. v(z) < v~(z) < v + (z). 

Definition 2.3. 

The problem T(z) has a limit value if v~{z) = v + (z). 
The problem T(z) has a uniform value if v(z) = v + (z). 

When the limit value exists, we denote it by v(z) = v~(z) = v + (z). For e > 0, 
a play s in S(z) such that liminf„ 7„(s) > v(z) — e is then called an e-optimal 
play for T{z). 

On the one hand, the notion of limit value corresponds to the case where the 
decision maker wants to maximize the quantities Mr{z-\) +r(z 2 ) + ... +r(z t )) for t 
large and known. On the other hand, the notion of uniform value is related to the 
case where the decision maker is interested in maximizing his long-run average 
payoffs without knowing the time horizon, i.e. quantities ^(r(zi)+r(z2)+ ...+r(zt)) 
for t large and unknown. We clearly have: 

Claim 2.4. T(z) has a uniform value if and only ifT(z) has a limit value v(z) 
and for every e > there exists an e-optimal play for T(z). 
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Remark 2.5. The uniform value is related to the notion of average cost criterion 
(see Araposthathis et ai, 1993, or Hernandez-Lerma and Lasserre, 1996). For 
example, a play s in S(z) is said to be "strong Average-Cost optimal in the sense 
of Flynn" if hm n ('y n (s) — v n (z)) = 0. Notice that (v n (z)) is not assumed to 
converge here. A 0-optimal play for T(z) satisfies this optimality condition, but 
in general e-optimal plays do not. 

Remark 2.6. Discounted payoffs. 

Other type of evaluations are used. For A G (0, 1], the A-discounted payoff of a 
play s = (zt)t is defined by: j\(s) = Ylt^i ^(1 — A)' _1 r(^). And the A-discounted 
value of F(z) is v x (z) = sup sg5(2) 7a(s). 

An Abel mean can be written as an infinite convex combination of Cesaro 
means, and it is possible to show that limsup A _^ t>A(<2) < nmsu Pn-»oo v n( z ) 
(Lehrer Sorin, 1992). One may have that \im\^ v\(z) and lim n ^ 00 t>„(2;) both 
exist and differ, however it is known that the uniform convergence of (f a) a is 
equivalent to the uniform convergence of (v n ) n , and whenever this type of con- 
vergence holds the limits are necessarily the same (Lehrer Sorin, 1992). 

A play s at z Q is said to be Blackwell optimal in r(z ) if there exists Ao > such 
that for all A G (0, Ao], 7a( s ) — v\{zq)- Blackwell optimality has been extensively 
studied after the seminal work of Blackwell (1962) who prove the existence of such 
plays in the context of MDP with finite sets of states and actions (see subsection 
16. ip . A survey can be found in Hordijk and Yushkevich, 2002. 

In general Blackwell optimal plays do not exist, and a play s at zq is said to 
be e-Blackwell optimal in T(z ) if there exists Ao > such that for all A G (0, Ao], 
7a( s ) > v x( z o) — e - We will prove at the end of section \5\ that : 1) if T(z) has 
a uniform value v(z), then (v\(z))\ converges to v(z), and e-Blackwell optimal 
plays exist for each positive e. And 2) the converse is false. Consequently, the 
notion of uniform value is (slightly) stronger than the existence of a limit for v\ 
and e-Blackwell optimal plays. 

3 Main results 

We will give in the sequel sufficient conditions for the existence of the uniform 
value. We start with general notations and lemmas. 

Definition 3.1. For s = (z t )t>i in S, m > and n > 1, we set: 
1 n 

7m,n(s) = -) r(z m +t) and v m ,n( s ) = min{7 m> i(s), t G {l,...,n}}. 

We have z/ mj „(s) < j m , n (s), and 7o,n(s) = 7n(«)- We write u n (s) = z/ , n (s) = 
min{j t (s),t G {1, ...,n}}. 

Definition 3.2. For z in Z , m > 0, and n > 1, we set: 

v m ,n( z ) = SU P Tm,n(s) and w m , n (z) = sup u m>n (s). 

ses(z) ses(z) 
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We have Vo jTl (z) = v n (s), and we also set w n (z) = Wo iTl (z). v„ hn corresponds to 
the case where the decision maker first makes m moves in order to reach a "good 
initial state" , then plays n moves for payoffs. w m ^ n corresponds to the case where 
the decision maker first makes m moves in order to reach a "good initial state" , 
but then his payoff only is the minimum of his next n average rewards (as if some 
adversary trying to minimize the rewards was then able to choose the length of 
the remaining game). This has to be related to the notion of uniform value, which 
requires the existence of plays giving high payoffs for any (large enough) length 
of the game. Of course we have w m ^ n+ i < u> m ,n < ^m,n and, since r takes values 
in [0,1], 

nv n <{m + n)v m+n < nv n + m and nv m , n < (m + n)v m+n < nv m>n + m. (1) 

We start with a few lemmas, which are true without assumption on the prob- 
lem. We first show that whenever the limit value exists it has to be sup m>0 inf n >i v m , n {z). 

Lemma 3.3. \/z G Z , 

v~(z) = sup inf v m>n (z). 

m>0 n > 1 

Proof: For every m and n, we have v mjn (z) < (l+m/n)v m+n (z), so for each m we 
get: infn^v^niz) < v~(z). Consequently, sup m > inf n >i v m ,n( z ) ^ v ~( z )> and it 
remains to show that sup m>0 inf n >i v m , n {z) > v~(z). Assume for contradiction 
that there exists e > such that for each m > 0, one can find n(m) > 1 satisfying 
v m ,n(m){ z ) < v ~( z ) ~ e - Define now m = 0, and set by induction m k+1 = n{m k ) 
for each k > 0. For each k, we have v mfc , mfc+1 < v~(z) — e, and also: 

(mi + ... + m k )v mi+ ... +mk (z) < m x v mi (z) + m 2 v mi;m2 (z) + ... + m k v mk ^ mk {z) . 

This implies v mi+ _,_ +mk (z) < v~(z) — e. Since lim^ mi + .... + m k = +oo, we 
obtain a contradiction with the definition of v~{z). □ 

The next lemmas show that the quantities u> m ,n are not that low. 
Lemma 3.4. VA; > 1, Vn > 1, Vm > 0, \/z e Z, 

k — 1 

V m , n (z) < SUp W hk (z) H . 

i>o n 

Proof: Fix k, n, m and z. Set A = sup />0 w^ k (z), and consider e > 0. 

By definition of v m ^ n {z), there exists a play s at z such that 7 m ,„(s) > v m ^ n {z) — 
e. For any i > m, we have that: min{7j it (s), t G {1, k}} = ^^(s) < w ijk {z) < 
A. So we know that for every i > m, there exists t{i) G {1, k} s.t. -fi t t(i)(s) < A. 

Define now by induction i\ — m, i 2 — i\ + t(ii),..., i q = i q ~\ + t{i q -i), where 
q is such that i q < n < i q + t{i q ). We have n7 mi „(s) < Ylp=i + ( n — h)^ — 

nA + k - 1, so 7m , n (s) < A + ^i. □ 
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Lemma 3.5. For every state z in Z , 

v + (z) < inf snpw m<n (z) = inf sup v m>n (z). 

">l m >0 ">l m >0 

Proof of lemma l3.5t Using lemma [37^1 with m = and arbitrary positive k, we 
can obtain limsup n f ri (z) < sup i>0 wi^(z). Sot> + (z) < inf n >x sup m>0 w m>n (z). We 
always have w min (z) < v mtTl (z), so clearly inf n >i sup m > w m ,„(z) < inf n >i sup m > v m . n {z). 
Finally, lemma 13^1 gives: VA; > 1, Vn > 1, Vm > 0, u m>n jfe(z) < sup^ >0 w\^{z) + K 
so sup m v m>nfe (» <swp l > wi ik (z) + ^. So inf n sup m v m>n (z) < inf„sup m t; mi „ fc (2:) < 
sup J>0 wi t k(z), and this holds for every positive k. □ 

Definition 3.6. We define W = {w min ,m > 0, n > 1} ; and for each z in Z: 

v*(z) = inf supw m ,„(z) = inf swpv mtU (z). 
«>i m >o «>i m >o 

W will always be endowed with the uniform distance d^w, w') = sup{\w(z) — 
w(z')\, z G Z}, so If is a metric space. Due to lemma I3T31 and lemma [331 we 
have the following chain of inequalities: 

sup inf w m!n {z) < sup inf v m ^ n {z) = v~(z) < v + (z) < v*(z). (2) 

One may have sup m>0 inf n >i w min (z) < sup m>0 inf n >i v m , n (z), as example 15. II will 
show later. Regarding the existence of the uniform value, the most general result 
of this paper is the following (see the acknowledgements at the end). 

Theorem 3.7. Let Z be a non empty set, F be a correspondence from Z to Z 
with non empty values, and r be a mapping from Z to [0, 1]. 

Assume that W is precompact. Then for every initial state z in Z , the problem 
T(z) = (Z,F,r,z) has a uniform value which is: 

v*(z) = v(z) = v + (z) = v~(z) = sup inf v m ^ n {z) = sup inf w m;n (z). 

m > n>l m^"^ 1 

And the sequence (v n ) n uniformly converges to v*. 

If the state space Z is precompact and the family (uv„) m >o, n >i is uniformly 
equicontinuous, then by Ascoli's theorem we obtain that W is precompact. So a 
corollary of theorem 13.71 is the following: 

Corollary 3.8. Let Z be a non empty set, F be a correspondence from Z to Z 
with non empty values, and r be a mapping from Z to [0, 1]. 

Assume that Z is endowed with a distance d such that: a) (Z,d) is a precom- 
pact metric space, and b) the family (m min ) m >o, n >i is uniformly equicontinuous. 



Then we have the same conclusions as theorem 3.1 
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Notice that if Z is finite, we can consider d such that d(z, z') — 1 if z ^ z', 
so corollary 13.81 gives the well known result: in the finite case, the uniform value 
exists. As the hypotheses of theorem 13. 71 and corollary 13. 8l depend on the auxiliary 
functions (u> m , n ), we now present an existence result with hypotheses directly 
expressed in terms of the basic data (Z, F, r). 

Corollary 3.9. Let Z be a non empty set, F be a correspondence from Z to Z 
with non empty values, and r be a mapping from Z to [0, 1]. 

Assume that Z is endowed with a distance d such that: a) (Z,d) is a precom- 
pact metric space, b) r is uniformly continuous, and c) F is non expansive, i.e. 
Wz G Z,Wz' G Z,\/z\ G F(z),3z' 1 G F(z') s.t. d(zi, z[) < d(z,z r ). Then we have 
the same conclusions as theorem 3. 7| 



Suppose for example that F has compact values, and use the Hausdorff dis- 
tance between compact subsets of Z: d(A, B) = Max{sup aeA d(a, B), sup 6gB d(A, b)}. 
Then F is non expansive if and only if it is 1-Lipschitz: d(F(z), F(z')) < d(z, z') 
for all (z, z') in Z 2 . 

Proof of corollary 13. 9t Assume that a), b), and c) are satisfied. 

Consider z and z' in Z, and a play s = (z t ) t >i in S(z). We have z x G F(z), 
and F is non expansive, so there exists z[ G F(z') such that d(zi,z[) < d(z,z'). 
It is easy to construct inductively a play (z' t ) t in S(z') such that for each t, 
d(z t ,z' t ) < d(z,z'). Consequently: 

V(z,z') G Z 2 ,Vs = (z t ) t M G S(z),Bs' = (^)*>i e S(z') s.t. Vt > l,d(z t ,z' t ) < d(z,z'). 

We now consider payoffs. Define the modulus of continuity e of r by i(a) = 
su Pz,z' s .t.d(z,z')< a \r{z) - r(z')\ for each a > 0. So \r(z) - r(z')\ < i(d(z,z')) 
for each pair of states z, z', and e is continuous at 0. Using the previous 
construction, we obtain that for z and z' in Z, Vm > 0,Vn > 1, \v m . n (z) — 
^^(^Ol — £(d(z,z')) and \w m>n (z) — w m ^ n (z')\ < e(d(z,z')). In particular, the 
family (ti> m , n )m>o,n>i ^ s uniformly continuous, and corollary 13.81 gives the result. 
□ 

We now provide an existence result for the limit value. 

Theorem 3.10. Let Z be a non empty set, F be a correspondence from Z to Z 
with non empty values, and r be a mapping from Z to [0, 1]. 

Assume that the set V = {v n ,n > 1}, endowed with the uniform distance, 
is a precompact metric space. Then for every initial state z in Z , the problem 
r(z) = (Z, F, r, z) has a limit value which is: 

v*(z) = inf sup v mjn (z) = sup inf v m , n (z). 
And the sequence {y n ) n uniformly converges to v* . 
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In particular, we obtain that the uniform convergence of (v n ) n is equivalent 
to the precompacity of V. And if (v n ) n uniformly converges, then the limit has 
to be v*. Notice that this does not imply the existence of the uniform value, as 
shown by the counter-examples in Monderer Sorin (1993) and Lehrer Monderer 
(1994). 



4 Proof of theorems 3.7 and 3.10 



4.1 Proof of theorem 13.71 

We assume that W is precompact, and prove here theorem 13.71 The proof is 
made in five steps. 



Step 1. Viewing Z as a precompact pseudometric space. 

Define d(z, z') = sup m n \w m>n (z) — w m>n (z')\ for all z, z' in Z. (Z,d) is a 
pseudometric space (hence may not be Hausdorff). Fix e > 0. By assumption on 
W there exists a finite subset / of indexes such that: Mm > 0, Vn > 1, Bi G / s.t. 
dooiw mtni Wi) < s. Since { (w7j (z) ) ie j , z G Z} is included in the compact metric 
space ([0, 1} 1 , uniform distance), we obtain the existence of a finite subset C of 
Z such that: Mz G Z, 3c G C s.t. Mi G /, \wi(z) — u>j(c)| < s. We obtain: 

For each e > 0, there exists a finite subset C of Z s.t. : Mz G Z, 3c G C, d(z, c) < 
e. 

Equivalently, every sequence in Z admits a Cauchy subsequence for d. 

In the sequel of subsection 14.11 Z will always be endowed with the pseudo- 
metric d. It is plain that every value function w m ^ n is now 1-Lipschitz. Since 
v*(z) = inf n >x sup m>0 w mtn (z), the mapping v* also is 1-Lipschitz. 

Step 2. Iterating F. 

We define inductively a sequence of correspondences (F n ) n from Z to Z, by 
F°(z) = {z} for every state z, and Mn > 0, F n+1 = F n oF (where the composition 
is defined by GoH(z) = {z" G Z, 3z' G H(z),z" G G(z')}). F n (z) represents the 
set of states that the decision maker can reach in n stages from the initial state 
z. It is easily shown by induction on m that: 

Mm >0,Mn> l,Mz G Z, w m ^ n (z) = sup w n (y). (3) 

y&F m (z) 

We also define, for every initial state z: G m (z) = [J™ =0 F n (z) and G°°(z) = 
U^Lo F n ( z )- The set G°°(z) is the set of states that the decision maker, start- 
ing from z, can reach in a finite number of stages. Since (Z, d) is precompact 
pseudometric, we can obtain the convergence of G m (z) to G co {z): 

Me >0,Mze Z, 3m > 0, Va; G G°°(z), By G G m (z) s.t. d(x, y) < e. (4) 
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(Suppose on the contrary that there exists e, z, and a sequence (z m ) m of points in 
G°°(z) such that the distance d(z m , G m (z)) is at least e for each m. Then by con- 
sidering a Cauchy subsequence (z v ( m )) m , one can find m such that for all m > m , 
d(z^ m ), 2y( mo )) — £ /2- Let now k be such that z^( m(l ) G G (z), we have for every 
m > k: e/2 > d(z v ( m ), %(m )) > d(z v(m) , G k (z)) > d{z ip{m) , G"^ (m) (z)) > e. Hence 
a contradiction.) 

Step 3. Convergence of (i; n (z)) n to f*(z). 

3. a. Here we will show that: 

We > 0,Wz G Z,3M >0,Wn> 1,3m < M s.t. w m , n {z) > v*(z) - e. (5) 

Fix e > and z in Z. By (jlj) there exists M such that: Wx G G°°(z), 3y G 
G M (z) s.t. d(x,y) < e. For each positive ra, by definition of v* there exists m(n) 
such that u> m ( n ) )Tl (z) > v*(z) — e. So by equation (Ej), one can find ?/ n in G m ( n )(z) 
s.t. w n (y n ) > v*(z) — 2e. By definition of M, there exists y' n in G M (z) such that 
d(y n , y' n ) < e. And w n (y' n ) > w n (y n ) -e>v*(z)- 3e. This proves <j$. 

3. b. Fix e > and z in Z, and consider M > given by (jSJ). Consider some m 
in {0, M} such that: u> m ,n(z) > v*(z) — e is true for infinitely many n's. Since 
U-Vn+i < w mj „, the inequality w m , n (;2) > f*(z) — £ is true for every n. We have 
improved step 3. a. and obtained: 

We > 0,Vz G Z,3m > 0,Vn > 1, w m ,„(z) > u*(z) -e. (6) 

Consequently, Wz <E Z,We > 0, sup m inf n w mj „(z) > f*(z) — e. So for every initial 
state z, sup m inf n w m>n (z) > v*(z), and inequalities (j2J) give: 

sup inf w mi n(z) = sup inf v m>n {z) = v~(z) = t> + (z) = v*(z). 

m n m n 

And (v n (z)) n converges to v*(z). 

Step 4. Uniform convergence of (v n ) n . 

4. a. Write, for each state z and n > 1: / n (z) = sup m>0 w m) „(2;). The sequence 
{fn)n is non increasing and simply converges to v*. Each /„ is 1-Lipschitz and Z 
is pseudometric precompact, so the convergence is uniform. As a consequence we 
get: 

We > 0,3n ,Vz G Z, sup w m , no (z) < f*(z) + e. 

m>0 

By lemma 13.44 we obtain: 

We > 0,3n ,Vz e Z,Wm> 0,Wn > l,v m>n (z) < v*(z) +e+ — - - . 
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Considering n\ > n /e gives: 

Vfe > 0, 3nx,Vz G Z, Wn > ni, v n {z) < supv mtn (z) < v*(z) + 2e (7) 

m>0 

4.b. Write now, for each state z and m > 0: g m (z) = sup m / <m inf n >i w m ^ n {z). 
{9m)m is non decreasing and simply converges to v*. As in 4. a., we can obtain 
that (g m )m uniformly converges. Consequently, 

> 0, 3M > 0, Vz e Z,3m< M, inf w mn (z) > v*(z) - e. (8) 

n>l 

Fix e > 0, and consider M given above. Consider iV > M/e. Then Vz G 
Vn > N, 3m < M s.t. w mtn (z) > v*(z) — e. But f n (z) > f miri (z) — m/n by ([1]), 
so we obtain v n (z) > v m>n (z) — e > v*(z) — 2e. We have shown: 

> 0,3N,Vz G Z,\Jn> N,v n {z) > v*(z)-2e. (9) 
By ((7j) and ([9]), the convergence of (v n ) n is uniform. 

Step 5. Uniform value. 

By claim l2~4"t in order to prove that T(z) has a uniform value it remains to 
show that e-optimal plays exist for every e > 0. We start with a lemma. 

Lemma 4.1. \/e > 0, 3M > 0, 3K > 1 , Vz G Z, 3m < M, Vn > fsf, 3s = (z t ) t >i G 
S'(z) sitc/i that: 

Vm,n{s) > v*(z) - s/2, and v*(z m+n ) > v*(z) - e. 

This lemma has the same flavor as Proposition 2 in Rosenberg et al. (2002), 
and Proposition 2 in Lehrer Sorin (1992). If we want to construct e- optimal 
plays, for every large n we have to construct a play which: 1) gives good av- 
erage payoffs if one stops the play at any large stage before n, and 2) after n 
stages, leaves the player with a good "target" payoff. This explains the impor- 
tance of the quantities u m ^ n which have led to the definition of the mappings w m ^ n . 

Proof of lemma l4.lt Fix e > 0. Take M given by property (JE1). Take K given 
by (I7j) such that: \/z G Z,Wn> K, v n (z) < sup m f m] „(z) < v*(z) +e. 

Fix an initial state z in Z. Consider m given by ([8]), and n > K. We have to 
finds = (z t ) t >i G S(z) such that: Vm,n{.s) > v*(z)—e/2, and v*(z m+n ) > v*{z)—e. 

We have w miTl '(z) > v*(z) — e for every n' > 1, so w m ^ n {z) > v*(z) — e, and 
we consider s = (zi, z t , ...) G S(z) which is e-optimal for w m> 2 n (z), in the sense 
that u m! 2 n (s) > w mi2n (z) - e. We have: 

Vm,n( S ) > Z/ m,2n(s) > W m ^n{z) ~ E > V*(z) ~ 2e. 

Write: X = 7 m , n (s) and Y = 7 m+n ,„(s). 

X Y 
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Since v m ,2n( s ) > v*(z) — 2e, we have X > v*(z) — 2e, and (X + Y)/2 = 7 mi 2«(s) > 
v *(z) — 2e. Since n > K, we also have X < v m ^ n (z) < v*(z) + e. And n > K also 
gives v n (z m+n ) < v*(z m+n ) + e, so v *{z m+n ) > v n (z m+n ) — e > Y — e. We write 
nowF/2 = (X + Y)/2-X/2 and obtain Y/2 > (v*(z)-5e)/2. So Y > v*(z)-5e, 
and finally v*(z m+n ) > v*(z) — Qe. □ 



Proposition 4.2. For every state z and e > there exists an e-optimal play in 
T(z). 

Proof: Fix a > 0. 

For every i > 1, set e$ = |j. Define Mj = M(£j) and i^j = K(Ei) given by 
lemma I4~T1 for £j. Define also rii as the integer part of 1 + MaxjfTj, Mi+1 }, so that 
simply rij > Kj and rij > 8+1 . 

We have: \/i > 1,V* e Z, 3m(z,i) < M h 3s = {z t ) t > x G S(z), s.t. 

*M;M),n<0) > ^*0) - and u *(z m(2ii)+ni ) > v*(z) - -. 

We now fix the initial state z in Z, and for simplicity write v* for v*(z). If 
a > v* it is clear that a-optimal plays at T(z) exist, so we assume v* — a > 0. 
We define a sequence (V, m^, s')j>i by induction: 

• first put z 1 — z, mi — m(z l , 1) < Mi, and pick s 1 = (^)t>i in S^z ) such 
that Vmuniis 1 ) > v*(z l ) - § , and u*C4 1+Bl ) > v*(z l ) - f . 

• for z > 2, put 2* = z£^_ 1+n ._ v m « = fn(z\i) < Mi, and pick = (z|)t>i G 
^(z*) such that v^^s*) > - and ^*(4^+rJ > - § ■ 

Consider finally s = (z\, z^, z^ 2+ri2 , zj, z l mi+ni , z\ +1 , ...). s 
is a play at z, and is defined by blocks: first s 1 is followed for mi + n\ stages, 
then s 2 is followed for m 2 + n 2 stages, etc... Since z* = z£^_ 1+n . for each z, s is 
a play at z. For each i we have rij > M i+1 /a > m i+1 /a, so the "n^ subblock" is 
much longer than the "m i+1 subblock" . 



mi stages n\ stages m ; stages stages 

5 i i i . . . i i 



s 1 s l 
For each i > 1, we have v*{z i ) > v*(z 1 ^ 1 ) — ^r- So v*(z i ) > — — gfe--- — 

f + V*(Z 1 ) > V* - a + § . So VmunM) > V* ~ «• 

Let now T be large. 

First assume that T = mi + m + ... + m;-! + n«_i + r, for some positive % and 
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r in {0, ...,mj}. We have: 



T 



i -ni x J- ( \ 

7t (s) = — - — - VffW 

i i — mi x 

T -mi 1 ^ 

> — ™ — ™ y. g{s t ) 

1 1 — nil 

t=mi+l 



But T - mi < ni + m 2 + ... + n.j_i + mj < (1 + a) fe}=i > so 

T — mi 
7r(s) > , r (i; -a), 
i (1 + a) 

And the right hand-side converges to (v* — a)/(l + a) as T goes to infinity. 

Assume now that T = mi +ni + ... + m^i + + + r, for some positive i 
and r in {0, n*}. The previous computation shows that: ni - gr(s t ) > 

( v *- a ). Since ^^(s*) > u*-a, we also have Et= mi+ „ 1+ ... +mi+ i > 
r(t>* — «). Consequently: 

r(f — a) 



T'Jt(s) > (T-mi-r)— h 

1 + a 



v -a v - a a(v - a) 

> 1 mi h T 

1 + a 1 + a 1 + a 

v * — a nil { v * — a ) 



1t{s) > . ... . 

1 + a 1 1 + a 

So we obtain liminf T 7 T (s) > (v* — a)/(l + a) = v * — + v*). We have 

proved the existence of a a(l + v*) optimal play in T(z) for every positive a, and 
this concludes the proofs of proposition 14.21 and consequently, of theorem 13.71 □ 



Remark 4.3. It is possible to see that properties (J7J) and (jHJ) imply the uniform 
convergence of (v n ) to v*(z) = sup m inf n w m>n (z) = sup m inf n v m>n (z), and step 5 
of the proof. So assuming in theorem 13.71 that (j7]) and (jSJ) hold, instead of the 
precompacity of W, still yields all the conclusions of the theorem. 

Remark 4.4. The hypothesis "W precompact" is quite strong and is not satisfied 
in the following example, which deals with Cesaro convergence of bounded real 
sequences. Take Z as the set of positive integers, the transition F simply is 
F(n) = {n + 1} (hence the system is uncontrolled here). The payoff function 
in state n is given by u n , where (u n ) n is the sequence of and l's defined by 
consecutive blocks: B 1 , B 2 ,..., B k ,..., where B k has length 2k and consists of 
k consecutive l's then k consecutive 0's. The sequence (u n ) n Cesaro-converges 
to 1/2, hence this is the limit value and the uniform value. We have 1/2 = 
sup m inf n v m>n , but v* = inf n sup m v m>n = 1, and W is not precompact here. 
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4.2 Proof of theorem 13.101 



We start with a lemma, which requires no assumption. 
Lemma 4.5. For every state z in Z , and m > 0, 

inf sup v m!n (z) < v~(z) < v + (z) < inf sup v m>n (z). 

n>l 0<m<m «>! m>0 

Proof: Because of lemma [375J we just have to prove here that inf n >x sup m<mo v m ^ n (z) 
< v~(z). Assume for contradiction that there exist z in Z, m > and e > 
such that: Vn > 1, 3m < m , v myn (z) > v~(z) + e. Then for each n > 1, we have 
(m + n)v mo+n (z) > n(v~(z)+e), which gives v mo+n (z) > ^^(v-(z)+e). This 
is a contradiction with the definition of v" . □ 

We now assume that V is precompact, and will prove theorem 13.101 The 
proof is made in three elementary steps, the first two being similar to the proof 
of theorem 13.71 

Step 1. Viewing Z as a precompact pseudometric space. 

Define d(z, z') = sup n>1 \v n (z) — v n (z')\ for all z, z' in Z. As in step 1 of the 
proof of theorem 13.71 we can use the assumption "V precompact" to prove the 
precompacity of the pseudometric space (Z, d). We obtain: 

For all e > 0, there exists a finite subset C of Z s.t. : \/z e Z, 3c G C, d(z, c) < e. 

In the sequel of subsection 14.21 Z will always be endowed with the pseudometric 
d. It is plain that every value function v n is now 1-Lipschitz. 

Step 2. Iterating F. 

We proceed as in step 2 of the proof of theorem 13.71 and define inductively 
the sequence of correspondences {F n ) n from Z to Z, by F°(z) = {z} for every 
state z, and Vn > 0, F n+1 = F n o F. F n (z) represents the set of states that the 
decision maker can reach in n stages from the initial state z. We easily have: 

Vm >0,Wn> l,Vz G Z, v m>n (z) — sup v n (z'). (10) 

z'eF m (z) 

We also define, for every initial state z: G m (z) = \_)™ =() F n (z) and G°°(z) = 
U^Lo F n {z). The set G°°(z) is the set of states that the decision maker, starting 
from z, can reach in a finite number of stages. And since (Z, d) is precompact 
pseudometric, we obtain the convergence of G m (z) to G°°(z): 

> 0,V^ G Z,3m > 0,Vz' G G°°(z), 3z" G G m {z) s.t. d{z',z") < e. (11) 

Step 3. Convergence of (v n ) n . Fix an initial state z. Because of ( !T0|) . the 
inequalities of lemma S3] give: for each tuq > 0, 

inf sup v n (z') < v~(z) < v + (z) < inf sup v n (z') = v*(z). 

z'<=G m o(z) ' ' z'GG°°(z) 
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To prove the convergence of (v n (z)) n to v*(z), it is thus enough to show that: 
Ve > 0, 3m s.t. inf n >x sup z , gG ™ 0(z) v n (z') > inf n >i sup z , 6G oo (;z) v n (z') - e. We 
will simply use the convergence of (G m (z)) m to G°°(z), and the equicontinuity of 
the family {v n ) n . 

Fix e > 0. By (HD, one can find m Q such that Mz' G G°°(z), 3z" G G m °(^) 
s.t. c^z', 2") < e. Fix n > 1, and consider z' G G°°(2:) such that v n {z') > 
sup ygG oo^) f n (y) — £• There exists z" in G m °(z) s.t. d(z',z") < e. Since u n is 
1-Lipschitz, we have v n (z") > sxvp yeGaa / z \ v n (y) — 2e, hence swp yeG m / z s v n (y) > 
sup ygG oo(^) v n (y) — 2e. Since this is true for every n, it concludes the proof of the 
convergence of {v n (z)) n to v*(z). 

Each v n is 1-Lipschitz and Z is precompact, hence the convergence of (v n ) n 
to v* is uniform. This concludes the proof of theorem 13.101 □ 

5 Comments 

We start with an example. 
Example 5.1. 

This example may be seen as an adaptation to the compact setup of an exam- 
ple of Lehrer and Sorin (1992), and illustrates the importance of condition c) (F 
non expansive) in the hypotheses of corollary 13.91 It also shows that in general 
one may have: sup m > inf n >i w m>n (z) ^ sup m > inf„>i v m>n (z). 

Define the set of states Z as the unit square [0, l] 2 plus some isolated point 
Zq. The transition is given by F{zq) = {(0,y),y G [0, 1]}, and for (x, y) in [0, l] 2 , 
F(x, y) = {(Min{l, x + y}, y)}. The initial state being zo, the interpretation is the 
following. The decision maker only has one decision to make, he has to choose 
at the first stage a point (0, y), with y G [0,1]. Then the play is determined, 
and the state evolves horizontally (the second coordinate remains y forever) with 
arithmetic progression until it reaches the line x = 1. y also represents the speed 
chosen by the decision maker: if y = 0, then the state will remain (0, 0) forever. 
If y > 0, the state will evolve horizontally with speed y until reaching the point 

(hy). 

1 

y 

12] 

3 3 

Let now the reward function r be such that for every (x, y) G [0, l] 2 , r(x, y) = 1 
if x G [1/3,2/3], and r(x,y) = if x ^ [1/4,3/4]. The payoff is low when x takes 
extreme values, so intuitively the decision maker would like to maximize the 
number of stages where the first coordinate of the state is "not too far" from 1/2. 
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Endow for example [0, l] 2 with the distance d induced by the norm ||.||i of 
M 2 , and set d(z , (x, y)) = 1 for every x and y in [0, 1]. [Z, d) is a compact metric 
space, and r can be extended as a Lipschitz function on Z. One can check that 
F is 2-Lipschitz, i.e. we have d(F(z), F(z')) < 2d(z,z') for each z, z'. 

For each n > 2, we have v n (zo) > 1/2 because the decision maker can reach 
the line x = 2/3 in exactly n stages by choosing initially (0, 3( - TO 2 _ 1 - ) )• But for 
each play s at zo, we have lim n 7„(s) = 0, so v(zo) = 0. The uniform value does 
not exist for T(zo). This shows the importance of condition c) of corollary 13.91 
although F is very smooth, it is not non expansive. As a byproduct, we obtain 
that there is no distance on Z compatible with the Euclidean topology which 
makes the correspondence F non expansive. 

We now show that sup m> g inf n >x w m n (2;o) < sup m> Q inf n >i v m ^ n (zo). We have 
sup m>0 inf n >i t> m ,n("2o) = v ~(zo) > 1/2. Fix now m > 0, and e > 0. Take n larger 
than and consider a play s = {zt)t>\ in S(z ) such that z/ m;n (s) > 0. By defi- 
nition of v m ^ n , we have 7 m ,i(s) > 0, so the first coordinate of z m+ i is in [1/4, 3/4]. 
If we denote by y the second coordinate of zi, the first coordinate of z m+ i is my, 
so my > 1/4. But this implies that 4m y > 1, so at any stage greater than Am 
the payoff is zero. Consequently n^ m ^ n (s) < 3m, and 7 m ,„(s) < s. z/ m n (s) < e, 
and this holds for any play s. So sup m>0 inf n >i w m>n (2 ) = 0. 

Example 5.2. 0- optimal strategies may not exist. 

The following example shows that 0-optimal strategies may not exist, even 
when the assumptions of corollary 13.91 hold, Z is compact and F has compact 
values. It is the deterministic adaptation of example 1.4.4. in Sorin (2002). 
Define Z as the simplex {z = (p a ,p b ,p c ) e lR\,p a + p b + p c = 1}. The payoff 
is r(p a ,p b ,p c ) = p b —p c , and the transition is defined by: F(p a ,p b ,p c ) = {((1 — 
a - a 2 )p a ,p b + ap a ,p c + a 2 p a ),a E [0,1/2]}. The initial state is z = (1,0,0). 
Notice that along any path, the second coordinate and the third coordinate are 
non decreasing. 

The probabilistic interpretation is the following: there are 3 points a, b and 
c, and the initial point is a. The payoff is at a, it is +1 at b, and -1 at c. At 
point a, the decision maker has to choose a G [0,1/2]: then b is reached with 
probability a, c is reached with probability a 2 , and the play stays in a with the 
remaining probability 1 — a — a 2 . When b (resp. c) is reached, the play stays at b 
(resp. c) forever. So the decision maker starting at point a wants to reach b and 
to avoid c. 

Back to our deterministic setup, we use norm ||.||i and obtain that Z is com- 
pact, F is non expansive and r is continuous. Applying corollary 13.91 gives the 
existence of the uniform value. 

Fix e in (0,1/2). The decision maker can choose at each stage the same 
probability e, i.e. he can choose at each state z t = (ptiPtiPt) the next z t +i as 
((1 — e — e 2 )p a , p b + ep a , p c + e 2 p a ). This sequence of states s = (z t )t converges to 
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(0, Y+i)- So li m i n ft7t( s ) = Yrf • Finally we obtain that the uniform value at 
z is 1. 

But as soon as the decision maker chooses a positive a at point a, he has a 
positive probability to be stuck forever with a payoff of -1, so it is clear that no 
0-optimal strategy exist here. 

Remark 5.3. On stationary e-optimal plays. 

A play s = (z t ) t >i in S is said to be stationary at zq if there exists a mapping 
/ from Z to Z such that for every positive t, z t = f(z t _i). We give here a positive 
and a negative result. 

A) When the uniform value exists, e-optimal play can always be chosen stationary. 

We just assume that T(z) has a uniform value, and proceed here as in the 
proof of theorem 2 in Rosenberg et ai, 2002. Fix the initial state z. Consider 
e > 0, a play s = (z t )t>i in S(z), and T such that VT > T , Jt(s) > v(z) — e. 

Case 1: Assume that there exist t\ and t% such that z tl = z t2 and the average 
payoff between t\ and t<i is good in the sense that: 7$ 1) t 2 (s) > v(z) — 2e. It is 
then possible to repeat the cycle between t\ and t 2 and obtain the existence of a 
stationary ("cyclic") 2e-optimal play in T(z). 

Case 2: Assume that there exists z' in Z such that {t > 0, z t = z'} is infinite: 
the play goes through z' infinitely often. Then necessarily case 1 holds. 

Case 3: Assume finally that case 1 does not hold. For every state z', the play 
s goes through z' a finite number of times, and the average payoff between two 
stages when z' occurs (whenever these stages exist) is low. 

We "shorten" s as much as possible. Set: yo = zq, i\ = max{t >0,z t = Zo}, 
yx = z il+ i, %i = max{t > 0, z t = yi}, and by induction for each k, yu = z>i k +\ and 
4+i = max{t > 0, zt = yk}, so that z ik+1 = y k = z ik+1 . The play s' = (y t )t>o can 
be played at z. Since all y t are distinct, it is a stationary play at z. Regarding 
payoffs, going from s to s' we removed average payoffs of the type 7^,^(5), where 
z tl = z t2 . Since we are not in case 1, each of these payoffs is less than v(z) — 2e, 
so going from s to s' we increased the average payoffs and we have: VT > To, 
It(s') > v(z) —e. s' is an e-optimal play at z, and this concludes the proof of A). 

Notice that we did not obtain the existence of a mapping / from Z to Z 
such that for every initial state z, the play (/*(z))t>i (where /* is / iterated t 
times) is e-optimal at z. In our proof, the mapping / depends on the initial state. 

B) Continuous stationary strategies which are e-optimal for each initial state may 
not exist. 

Assume that the hypotheses of corollary 13.91 are satisfied. Assume also that 
Z is a subset of a Banach space and F has closed and convex values, so that F 
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admits a continuous selection (by Michael's theorem). The uniform value exists, 
and by A) we know that e-optimal plays can be chosen to be stationary. So if we 
fix an initial state z, we can find a mapping / from Z to Z such that the play 
(/*(^))<>i is ^-optimal at z. Can / be chosen as a continuous selection of Y ? 

A stronger result would be the existence of a continuous / such that for every 
initial state z, the play (f t {z))t>i is ^-optimal at z. However this existence is not 
guaranteed, as the following example shows. Define Z = [—1, 1] U [2, 3], with the 
usual distance. Set F(z) = [2, z + 3] if z G [-1, 0], F(z) = [z + 2, 3] if z G [0, 1], 
and F(z) = {z} if z G [2, 3]. Consider the payoff r(z) = \z — 5/2 1 for each z. 



3 




-10 12 3 

The hypotheses of corollary 13.91 are satisfied. The states in [2,3] correspond to 
final ("absorbing" states), and v(z) = \z — 5/2\ if z G [2, 3]. If the initial state z is 
in [—1, 1], one can always choose the final state to be 2 or 3, so that v(z) = 1/2. 
Take now any continuous selection / of T. Necessarily /(— 1) = 2 and f(l) = 3, 
so there exists z in (—1, 1) such that f(z) = 5/2. But then the play s = (/*0z))t>i 
gives a null payoff at every stage, and for e G (0, 1/2) is not e-optimal at z. 



Remark 12.61 continued. Discounted payoffs, proofs. 



We prove here the results announced in remark 12.61 about discounted payoffs. 
Proceeding similarly as in definition 12.31 and claim 12. 4[ we say that V{z) has a 
(i-uniform value if: {v\(z))\ has a limit v(z) when A goes to zero, and for every 
e > 0, there exists a play s at z such that lim inf a-»o 7a (s) > v(z) —e. Whereas the 
definition of uniform value fits Cesaro summations, the definition of (i-uniform 
value fits Abel summations. 

Given a sequence (a t )t>i of nonnegative real numbers, we denote for each 
n > 1 and A G (0, 1], by a n the Cesaro mean - YLt=i a *' anc ^ by d\ the Abel mean 
YlpLi ^(1 — A)*^ 1 ^. We have the following Abelian theorem (see e.g. Lippman 
1969, or Sznajder and Filar, 1992): 

lim sup d n > lim sup a a > lim inf d\ > lim inf d n . 

And the convergence of a a, as A goes to zero, implies the convergence of d n , as 
n goes to infinity, to the same limit (Hardy and Littlewood Theorem, see e.g. 
Lippman 1969). 

Lemma 5.4. IfT(z) has a uniform value v(z), then T(z) has a d-uniform value 
which is also v(z). 
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Proof: Assume that T(z) has a uniform value v(z). Then for every e > 0, there 
exists a play s at z such that liminf A ^ j\(s) > liminf^oo j n (s) > v(z) — e. So 
lim inf a_>o v x( z ) > v ( z )- But one always has limsup n t> n (z) > limsup A t'A(-2)(Lehrer 
Sorin 1992). So V\(z) — >a^o v {z), and there is a d-uniform value. □ 

We now give a counter-example to the converse of lemma 15.41 Liggett and 
Lippman, 1969, showed how to construct a sequence (a t ) t >i with values in {0, 1} 
such that a* := limsup A ^ d\ < limsup^^ a n . LetJllus define Z = IN and z = 0. 
The transition satisfies: F(0) = {0, 1}, and F(t) = {t + 1} is a singleton for each 
positive t. The reward function is defined par r(0) = a*, and for each t > 1, 
r{t) = a t . A play in S(zo) can be identified with the number of positive stages 
spent in state 0: there is the play s(oo) which always remains in state 0, and 
for each k > the play s(k) = (s t (A;))t>i which leaves state after stage k, i.e. 
s t (k) = for t < k, and s t (k) = t — k otherwise. 

For every A in (0, 1], 7a(s(oo)) = a*, 7a(s(0)) = a\, and for each k, 7 A (s(fc)) is 
a convex combination between 7a(s(oo)) and 7a(s(0)), so v\(zo) = max{a*,aA}. 
So v\(z ) converges to a* as A goes to zero. Since s(oo) guarantees a* in every 
game, T(zo) has a <i-uniform value. 

For each n > 1, v n (zo) > j n {s{0)) = a n , so limsup n f n (2;o) > limsup^^ a n . 
But for every play s at zq, liminf n 7 n (s) < max{a*, liminf n a n } = a*. The 
decision maker can guarantee nothing more than a*, so he can not guarantee 
limsup n v n (zo), and T(zo) has no uniform value. 

6 Applications to Markov decision processes 

We start with a simple case. 

6.1 MDPs with a finite set of states. 

Consider a finite set of states K, with an initial probability po on K , a non empty 
set of actions A, a transition function q from K x A to the set A(K) of probability 
distributions on K, and a reward function g from K x A to [0, 1]. 

This MDP is played as follows. An initial state k\ in K is selected according to 
Po and told to the decision maker, then he selects a\ in A and receives a payoff of 
g(ki, ai). A new state k 2 is selected according to q(ki, ai) and told to the decision 
maker, etc... A strategy of the decision maker is then a sequence a = (cr t ) t >i, 
where for each t, o t : (K x A) l ~ l x K — > A defines the action to be played at 
stage t. Considering expected average payoffs in the first n stages, the definition 
of the n-stage value v n (po) naturally adapts to this case. And the notions of limit 
value and uniform value also adapt here. Write ^{po) for this MDP. 

1 We proceed similarly as in Flynn (1974), who showed that a Blackwell optimal play need 
not be optimal with respect to "Derman's average cost criterion" . 
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We define an auxiliary (deterministic) dynamic programming problem T(zq). 
We view A(K) as the set of vectors p = (p k )k in -R+ such that J2 k p k = 1. We 
introduce: 

• a new set of states Z = A(K) x [0, 1], 

• a new initial state z = (p , 0), 

• a new payoff function r : Z — > [0, 1] such that r(p,y) = y for all 
(P, y) in Z, 

• a transition correspondence F from Z to Z such that for every 
z = (p, y) in Z, 

= < I J2p k q(k,a k ),J2p k g(k,a k ) ) ,a k e AWk e K> . 
I VfceA- keK ) J 

Notice that F((p,y)) does not depend on y, hence the value functions in T(z) 
only depend on the first component of z. It is easy to see that the value functions 
of T and \I> are linked as follows: Wz = (p,y) G Z, Wn > 1, v n (z) = v n {p). 
Moreover, anything that can be guaranteed by the decision maker in T(p, 0) can 
also be guaranteed in ^f{p). So if we prove that the auxiliary problem T(po,0) 
has a uniform value, then (v n (po)) n has a limit that can be guaranteed, up to 
every e > 0, in T(po, 0), hence also in ^(po). And we obtain the existence of the 
uniform value for ty(p ). 

It is convenient to set d((p,y), (p',y')) = max{||p— p'\\i, \y — y'\}. Z is compact 
and r is continuous. F may have non compact values, but is non expansive so 
that we can apply corollary 13.91 Consequently, for each p Q , ^(po) has a uniform 
value, and we have obtained the following result. 

Theorem 6.1. Any MDP with finite set of states has a uniform value. 

We could not find theorem 16. II in the literature. The case where A is finite is 
well known since the seminal work of Blackwell (1962), who showed the existence 
of Blackwell optimal plays. If A is compact and both q and g are continuous in 
a, the uniform value was known to exist (see Dynkin Yushkevich, 1979, or Sorin, 
2002, Corollary 5.26). In this case, more properties on (e)-optimal strategies have 
been obtained. 

6.2 MDPs with partial observation. 

We now consider a more general model where after each stage, the decision maker 
does not perfectly observe the state. We still have a finite set of states K, an 
initial probability po on K, a non empty set of actions A, but we also have a non 
empty set of signals S. The transition q now goes from K x A to Af(S x K), the 
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set of probabilities with finite support on S x K, and the reward function g still 
goes from K x A to [0, 1]. 

This MDP \P(po) is played by a decision maker knowing K, p , A, S, q and g 
and the following description. An initial state k\ in K is selected according to po 
and is not told to the decision maker. At every stage t the decision maker selects 
an action a t € A, and has a (unobserved) payoff g(k t ,a t ). Then a pair (s t ,k t+ i) 
is selected according to q(k t ,a t ), and s t is told to the decision maker. The new 
state is k t+ i, and the play goes to stage t + 1. 

The existence of the uniform value was proved in Rosenberg et al. in the 
case where A and S are finite set^E We show here how to apply corollary 13.81 to 
this setup, and generalize the mentioned result of Rosenberg et al. to the case of 
arbitrary sets of actions and signals. 

A pure strategy of the decision maker is then a sequence a = (a t )t>i, where 
for each t, at : (A x S 1 )' -1 — > A defines the action to be played at stage t. More 
general strategies are behavioral strategies, which are sequences a = (at)t>x, 
where for each t, at : (Ax S 1 )* -1 — > Ay (A) and Ay(^4) is the set of probabilities 
with finite support on A. In ^(po) we assume that players use behavior strategies. 
Any strategy induces, together with p , a probability distribution over (K x A x 
S)°°, and we can define expected average payoffs and n-stage values v n (p ). These 
rz-stage values can be obtained with pure strategies. However, one has to be 
careful when dealing with an infinite number of stages: in general it may not be 
true that something which can be guaranteed by the decision maker in ty(po), 
i.e.. with behavior strategies, can also be guaranteed by the decision maker with 
pure strategies. We will prove here the existence of the uniform value in ty(jpo), 
and thus obtain: 

Theorem 6.2. // the set of states is finite, a MDP with partial observation, 
played with behavioral strategies, has a uniform value. 

Proof: As in the previous model, we view A(K) as the set of vectors p = {p k )k in 
1R+ such that J2 k p k = 1. We write X = A(K), and use ||.||i on X. Assume that 
the state of some stage has been selected according to p in X and the decision 
maker plays some action a in A. This defines a probability on the future belief of 
the decision maker on the state of the next stage. It is a probability with finite 
support because we have a belief in X for each possible signal S, and we denote 
this probability on X by q(p,a). To introduce a deterministic problem we need 
a larger space than X. 

We define A(A) as the set of Borel probabilities over X, and endow A(A) with 
the weak-* topology. A(X) is now compact and the set Ay(X) of probabilities 
on X with finite support is a dense subset of A(X). Moreover, the topology on 

2 These authors also considered the case of a compact action set, with some continuity on g 
and q, see comment 5 p. 1192. 
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ApT) can be metrized by the (Fortet-Mourier-)Wasserstein distance, defined by: 

\/u e A(X),Vv e A(X), d(u,v) = sup \u(f) -v(f)\, 

feE! 

where: E\ is the set of 1-Lipschitz functions from X to M, and it(/) — L^x f(p)du(p). 
One can check that this distance also has the nice following propertiesjj 

1) for p and q in X, the distance between the Dirac measures 5 P and 5 q is 

lb 

2) For every continuous mapping from X to the reals, let us denote by / the 
affine extension of / to A(X). We have f(u) = u(f) for each u. Then for each 
C > 0, we obtain the equivalence: / is C-Lipschitz if and only if / is C-Lipschitz. 

We will need to consider a whole class of value functions. Let 9 = Ylt>i @t$t 
be in Af(lN*), i.e. 6 is a probability with finite support over positive inte- 
gers. For p in X and any behavior strategy a, we define the payoff: 7^](cr) = 
Ep Pta (E^i °t g[h> a t)), an d tn e value: v [e] (p) = sup CT 7^(«t). If = 1/n Y% =1 S t , 
V[Q](p) is nothing but v n (p). vye\ is a 1-Lipschitz function so its affine extension 
V\0] also is. A standard recursive formula can be written: if we write 9 + for the 
law of t* — 1 given that t* (selected according to 0) is greater than 1, we get for 
each 9 and p: v [e] (p) = sup aeA (#i J2kP k 9( k , a ) + i 1 ~ 9i)v[e+](q(p, a))) . 

We now define a deterministic problem T(zq). An element u in Af(X) is 
written u = Yl P ex u (p)^p^ an d similarly an element v in Af(A) is written v = 
^ aeA f(a)5 a . Notice that if p ^ q, then 1/2 5 P + 1/2 5 q is different from Si/ 2p +i/2 q - 
We introduce: 

• a new set of states Z = Aj(Jf) x [0, 1], 

• a new initial state z = (5 Po , 0), 

• a new payoff function r : Z — > [0, 1] such that r(u,y) = y for all 
(it, 2/) in Z, 

• a transition correspondence F from Z to Z such that for every 
z = (it, y) in Z: 

F(*) = {(#(«, /), R(u, /)),/: X — A,(A)} , 

where #(u, /) = E pe x M <» (EaeA f(p)( a )^ «)) e A /P0, 
and i?(u,/) = Epex^W (j2 k eK,aeAP k f(j ) )( a )9(k, a)) • 

r(z ) is a well defined dynamic programming problem. F(u,y) does not de- 
pend on ?/, so the value functions in T(z) only depend on the first coordinate 

3 Notice that if d(k, k') = 2 for any distinct states in K, then sup^.^-^jfj 1 _^ ip | ^2kP k f{k) — 
J2k Q k f(k)\ = lb — <z||i for every p and q in A(K). 
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of z. For every 9 = ^2 t>1 9tSt in A/(JV*) and play s = (z t )t>i, we define 
the payoff 7 [0] (s) = Y.7=x®tr{z t ), an d the value : v [e] (z) = sup se5(z) 7 [e] (s). 
If 6> = 1/n J2t= m +i S t> 7[ff](*) is nothing but 7 m , n (s), and t> [0 ](» is nothing 
but t> m) „(z), see definitions 13.11 and 13.21 7[t](s) is just the payoff of stage t, 

i.e. r(^t). The recursive formula now is: V[g]((u,y)) = supj. x >A f (A)(9iR(u, f) 

+ (1 — 6i)v [g+](H(u, f), 0)), and the supremum can be taken on deterministic 
mappings / : X — > A. Consequently, the value functions are linked as follows: 
Vz = (u, y) G Z, U[0](/z) = V[e\{u). Moreover, anything which can be guaranteed by 
the decision maker in T(zq) can be guaranteed in the original MDP ^f{p ). So the 
existence of the uniform value in T(zo) will imply the existence of the uniform 
value in \I/(po)- 

We set d((u,y), (u',y')) = max{d(u,u'), \y — y'\}. Since Af(X) is dense in 
A(JT) for the Wasserstein distance, Z is a precompact metric space. By corollary 
13.81 if we show that the family (u> m ,n)m>o,n>i is uniformly equicontinuous, we will 
be done. Notice already that since v\m is a 1-Lipschitz function of u, v\m is a 
1-Lipschitz function of z. 

Fix now z in Z, m > and n > 1. We define an auxiliary zero-sum game 
A(m,n,z) as follows: player l's strategy set is S(z), player 2's strategy set is 
A({1, n}), and the payoff for player 1 is given by: l(s, 0) = J2t=i ®tlm,t{.s). We 
will apply a minmax theorem to A(m, n, z), in order to obtain: sup s infg l(s, 6) = 
mfg sup s l(s, 6). We can already notice that sup s infg l(s, 0) = sup sgS .( z ) infte{i,..., n } 
lm,t( s ) = w m,n(z). A({l,...,n}) is convex compact and I is affine continuous in 
9. We will show that S(z) is a convex subset of Z, and first prove that F is an 
affine correspondence. 

Lemma 6.3. For every z' and z" in Z, and A G [0,1], F(Xz' + (1 — X)z") = 
XF(z') + (1 - \)F{z"). 

Proof: Write z' = («', y'), z" = (u", y") and z = (u, y) = Xz'+(1-X)z". We have 
u(p) = Xu'{p) + (1 — X)u"{p) for each p. It is easy to see that F(z) C XF(z') + (1 — 
X)F(z"), so we just prove the reverse inclusion. Let z[ = (H(u', /'), R(u', /')) be 
in F(z') and z'[ = (H(u"J"),R(u"J")) be in F(z"), with /' and f" mappings 
from X to Af(A). Using here the convexity of Af(A), we simply define for each 
p in X, f(p) = ^f(p) + (1 -y' f(p). We have for each p, i?(5 p , /) = 
^R(5 P , f) + ^Mr(6 p , f"). So R(u, f) = XR(u', f) + (1 - X)R(u", f"). 
Similarly the transitions satisfy: H(uJ) = XH(u',f) + (1 - X)H{u"J"). And 
we obtain that Xz[ + (1 - X)z'{ = (H(u, /), /)) G □ 

As a consequence, the graph of F is convex, and this implies the convexity of 
the sets of plays. So we have obtained the following result. 

Corollary 6.4. The set of plays S(z) is a convex subset of Z°°. 

Looking at the definition of the payoff function r, we now obtain that I is affine 
in s. Consequently, we can apply a standard minmax theorem (see e.g. Sorin 
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2002 proposition A8 p. 157) to obtain the existence of the value in A(m, n, z). So 

W m ,n(z) = inftfeA({l,...,n})SUp aeSW X)^=i^7m,t(s)- But su Ps 6 5( 2 ) EtLl ^tlm,t( S ) is 

equal to v\gm,n](z), where 8 m ' n is the probability on {1, m+n} such that 9™' n = 
if s < m, and 6 l ™' n = E™ =s _ m j if m < s < n + m. The precise value of 9 m ' n does 
not matter much, but the point is to write: w m>n (z) = infg e A({i,...,n}) f^™.™]^)- So 
w mM is 1-Lipschitz as an infimum of 1-Lipschitz mappings. The family (io m , n ) mjn 
is uniformly equicontinuous, and the proof of theorem 16.21 is complete. □ 

Remark 6.5. The following question, mentioned in Rosenberg et ai, is still open. 
Does there exist pure e-optimal strategies ? 
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